<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topic
  PUBLIC "-//OASIS//DTD DITA Composite//EN" "ditabase.dtd">
<topic id="topic1">
  <title id="ku137116">gpfdist</title>
  <body>
    <p>Serves data files to or writes data files out from Greenplum Database segments.</p>
    <section id="section2">
      <title>Synopsis</title>
      <codeblock>gpfdist [-d &lt;directory>] [-p &lt;http_port>] [-P &lt;last_http_port>] [-l &lt;log_file>]
   [-t &lt;timeout>] [-S] [-w &lt;time>] [-v | -V] [-s] [-m &lt;max_length>]
   [--ssl &lt;certificate_path> [--sslclean &lt;wait_time>] ]
   [-c &lt;config.yml>]

gpfdist -? | --help 

gpfdist --version</codeblock>
    </section>
    <section id="section3">
      <title>Description</title>
      <p><codeph>gpfdist</codeph> is Greenplum Database parallel file distribution program. It is
        used by readable external tables and <codeph>gpload</codeph> to serve external table files
        to all Greenplum Database segments in parallel. It is used by writable external tables to
        accept output streams from Greenplum Database segments in parallel and write them out to a
        file.</p>
      <note><codeph>gpfdist</codeph> and <codeph>gpload</codeph> are compatible only with the
        Greenplum Database major version in which they are shipped. For example, a
          <codeph>gpfdist</codeph> utility that is installed with Greenplum Database 4.x cannot be
        used with Greenplum Database 5.x or 6.x.</note>
      <p>In order for <codeph>gpfdist</codeph> to be used by an external table, the
          <codeph>LOCATION</codeph> clause of the external table definition must specify the
        external table data using the <codeph>gpfdist://</codeph> protocol (see the Greenplum
        Database command <codeph>CREATE EXTERNAL TABLE</codeph>). </p>
      <note>If the <codeph>--ssl</codeph> option is specified to enable SSL security, create the
        external table with the <codeph>gpfdists://</codeph> protocol.</note>
      <p>The benefit of using <codeph>gpfdist</codeph> is that you are guaranteed maximum
        parallelism while reading from or writing to external tables, thereby offering the best
        performance as well as easier administration of external tables.</p>
      <p>For readable external tables, <codeph>gpfdist</codeph> parses and serves data files evenly
        to all the segment instances in the Greenplum Database system when users
          <codeph>SELECT</codeph> from the external table. For writable external tables,
          <codeph>gpfdist</codeph> accepts parallel output streams from the segments when users
          <codeph>INSERT</codeph> into the external table, and writes to an output file.</p>
      <note>When <codeph>gpfdist</codeph> reads data and encounters a data formatting error, the
        error message includes a row number indicating the location of the formatting error.
          <codeph>gpfdist</codeph> attempts to capture the row that contains the error. However,
          <codeph>gpfdist</codeph> might not capture the exact row for some formatting
        errors.</note>
      <p>For readable external tables, if load files are compressed using <codeph>gzip</codeph> or
          <codeph>bzip2</codeph> (have a <codeph>.gz</codeph> or <codeph>.bz2</codeph> file
        extension), <codeph>gpfdist</codeph> uncompresses the data while loading the data (on the
        fly). For writable external tables, <codeph>gpfdist</codeph> compresses the data using
          <codeph>gzip</codeph> if the target file has a <codeph>.gz</codeph> extension.</p>
      <note type="note">Compression is not supported for readable and writeable external tables when
        the <codeph>gpfdist</codeph> utility runs on Windows platforms.</note>
      <p>When reading or writing data with the <codeph>gpfdist</codeph> or <codeph>gpfdists</codeph>
        protocol, Greenplum Database includes <codeph>X-GP-PROTO</codeph> in the HTTP request header
        to indicate that the request is from Greenplum Database. The utility rejects HTTP requests
        that do not include <codeph>X-GP-PROTO</codeph> in the request header.</p>
      <p>Most likely, you will want to run <codeph>gpfdist</codeph> on your ETL machines rather than
        the hosts where Greenplum Database is installed. To install <codeph>gpfdist</codeph> on
        another host, simply copy the utility over to that host and add <codeph>gpfdist</codeph> to
        your <codeph>$PATH</codeph>.</p>
      <note type="note">When using IPv6, always enclose the numeric IP address in brackets.</note>
    </section>
    <section id="section4">
      <title>Options</title>
      <parml>
        <plentry>
          <pt>-d <varname>directory</varname></pt>
          <pd>The directory from which <codeph>gpfdist</codeph> will serve files for readable
            external tables or create output files for writable external tables. If not specified,
            defaults to the current directory.</pd>
        </plentry>
        <plentry>
          <pt>-l <varname>log_file</varname></pt>
          <pd>The fully qualified path and log file name where standard output messages are to be
            logged.</pd>
        </plentry>
        <plentry>
          <pt>-p <varname>http_port</varname></pt>
          <pd>The HTTP port on which <codeph>gpfdist</codeph> will serve files. Defaults to
            8080.</pd>
        </plentry>
        <plentry>
          <pt>-P <varname>last_http_port</varname></pt>
          <pd>The last port number in a range of HTTP port numbers (<varname>http_port</varname> to
              <varname>last_http_port</varname>, inclusive) on which <codeph>gpfdist</codeph> will
            attempt to serve files. <codeph>gpfdist</codeph> serves the files on the first port
            number in the range to which it successfully binds.</pd>
        </plentry>
        <plentry>
          <pt>-t <varname>timeout</varname></pt>
          <pd>Sets the time allowed for Greenplum Database to establish a connection to a
              <codeph>gpfdist</codeph> process. Default is 5 seconds. Allowed values are 2 to 7200
            seconds (2 hours). May need to be increased on systems with a lot of network
            traffic.</pd>
        </plentry>
        <plentry>
          <pt>-m <varname>max_length</varname></pt>
          <pd>Sets the maximum allowed data row length in bytes. Default is 32768. Should be used
            when user data includes very wide rows (or when <codeph>line too long</codeph> error
            message occurs). Should not be used otherwise as it increases resource allocation. Valid
            range is 32K to 256MB. (The upper limit is 1MB on Windows systems.)</pd>
          <pd>
            <note>Memory issues might occur if you specify a large maximum row length and run a
              large number of <codeph>gpfdist</codeph> concurrent connections. For example, setting
              this value to the maximum of 256MB with 96 concurrent <codeph>gpfdist</codeph>
              processes requires approximately 24GB of memory (<codeph>(96 + 1) x
              246MB</codeph>).</note>
          </pd>
        </plentry>
        <plentry>
          <pt>-s</pt>
          <pd>
            <p>Enables simplified logging. When this option is specified, only messages with
                <codeph>WARN</codeph> level and higher are written to the <codeph>gpfdist</codeph>
              log file. <codeph>INFO</codeph> level messages are not written to the log file. If
              this option is not specified, all <codeph>gpfdist</codeph> messages are written to the
              log file. </p>
          </pd>
          <pd>You can specify this option to reduce the information written to the log file. </pd>
        </plentry>
        <plentry>
          <pt>-S (use O_SYNC)</pt>
          <pd>Opens the file for synchronous I/O with the <codeph>O_SYNC</codeph> flag. Any writes
            to the resulting file descriptor block <codeph>gpfdist</codeph> until the data is
            physically written to the underlying hardware.</pd>
        </plentry>
        <plentry>
          <pt>-w <varname>time</varname></pt>
          <pd>Sets the number of seconds that Greenplum Database delays before closing a target file
            such as a named pipe. The default value is 0, no delay. The maximum value is 7200
            seconds (2 hours). </pd>
          <pd>For a Greenplum Database with multiple segments, there might be a delay between
            segments when writing data from different segments to the file. You can specify a time
            to wait before Greenplum Database closes the file to ensure all the data is written to
            the file. </pd>
        </plentry>
        <plentry>
          <pt>--ssl <varname>certificate_path</varname></pt>
          <pd>Adds SSL encryption to data transferred with <codeph>gpfdist</codeph>. After running
              <codeph>gpfdist</codeph> with the <codeph>--ssl
              <varname>certificate_path</varname></codeph> option, the only way to load data from
            this file server is with the <codeph>gpfdist://</codeph> protocol. <ph>For information
              on the <codeph>gpfdist://</codeph> protocol, see "Loading and Unloading Data" in the
                <i>Greenplum Database Administrator Guide</i>.</ph></pd>
          <pd>The location specified in <varname>certificate_path</varname> must contain the
            following files:<ul id="ul_dyp_eqt_mo">
              <li id="ku140282">The server certificate file, <codeph>server.crt</codeph></li>
              <li id="ku140283">The server private key file, <codeph>server.key</codeph></li>
              <li id="ku140284">The trusted certificate authorities, <codeph>root.crt</codeph></li>
            </ul><p>The root directory (<codeph>/</codeph>) cannot be specified as
                <varname>certificate_path</varname>.</p></pd>
        </plentry>
        <plentry>
          <pt>--sslclean <varname>wait_time</varname></pt>
          <pd>When the utility is run with the <codeph>--ssl</codeph> option, sets the number of
            seconds that the utility delays before closing an SSL session and cleaning up the SSL
            resources after it completes writing data to or from a Greenplum Database segment. The
            default value is 0, no delay. The maximum value is 500 seconds. If the delay is
            increased, the transfer speed decreases.</pd>
          <pd>In some cases, this error might occur when copying large amounts of data:
              <codeph>gpfdist server closed connection</codeph>. To avoid the error, you can add a
            delay, for example <codeph>--sslclean 5</codeph>. </pd>
        </plentry>
        <plentry>
          <pt>-c <varname>config.yaml</varname></pt>
          <pd>Specifies rules that <codeph>gpfdist</codeph> uses to select a transform to apply when
            loading or extracting data. The <codeph>gpfdist</codeph> configuration file is a YAML
            1.1 document. </pd>
          <pd>For information about the file format, see <xref
              href="../../admin_guide/load/topics/transforming-xml-data.html#topic83" format="html" scope="external">Configuration File Format</xref> in the
              <cite>Greenplum Database Administrator Guide</cite>. For information about configuring
            data transformation with <codeph>gpfdist</codeph>, see <xref
              href="../../admin_guide/load/topics/transforming-xml-data.html#topic75" format="html" scope="external">Transforming External Data with gpfdist and gpload</xref> in the
              <cite>Greenplum Database Administrator Guide</cite>.</pd>
          <pd>This option is not available on Windows platforms.</pd>
        </plentry>
        <plentry>
          <pt>-v (verbose)</pt>
          <pd>Verbose mode shows progress and status messages.</pd>
        </plentry>
        <plentry>
          <pt>-V (very verbose)</pt>
          <pd>Verbose mode shows all output messages generated by this utility.</pd>
        </plentry>
        <plentry>
          <pt>-? (help)</pt>
          <pd>Displays the online help.</pd>
        </plentry>
        <plentry>
          <pt>--version</pt>
          <pd>Displays the version of this utility.</pd>
        </plentry>
      </parml>
    </section>
    <section>
      <title>Notes</title>
      <p>The server configuration parameter <xref
          href="../../ref_guide/config_params/guc-list.xml#verify_gpfdists_cert"
          >verify_gpfdists_cert</xref> controls whether SSL certificate authentication is enabled
        when Greenplum Database communicates with the <codeph>gpfdist</codeph> utility to either
        read data from or write data to an external data source. You can set the parameter value to
          <codeph>false</codeph> to disable authentication when testing the communication between
        the Greenplum Database external table and the <codeph>gpfdist</codeph> utility that is
        serving the external data. If the value is <codeph>false</codeph>, these SSL exceptions are
          ignored:<ul id="ul_vvb_5lj_mdb">
          <li>The self-signed SSL certificate that is used by <codeph>gpfdist</codeph> is not
            trusted by Greenplum Database.</li>
          <li>The host name contained in the SSL certificate does not match the host name that is
            running <codeph>gpfdist</codeph>.</li>
        </ul>
        <note type="warning">Disabling SSL certificate authentication exposes a security risk by not
          validating the <codeph>gpfdists</codeph> SSL certificate. </note></p>
      <p>You can set the server configuration parameter <xref
          href="../../ref_guide/config_params/guc-list.xml#gpfdist_retry_timeout"/> to control the
        time that Greenplum Database waits before returning an error when a <codeph>gpfdist</codeph>
        server does not respond while Greenplum Database is attempting to write data to
          <codeph>gpfdist</codeph>. The default is 300 seconds (5 minutes).</p>
      <p>If the <codeph>gpfdist</codeph> utility hangs with no read or write activity occurring, you
        can generate a core dump the next time a hang occurs to help debug the issue. Set the
        environment variable <codeph>GPFDIST_WATCHDOG_TIMER</codeph> to the number of seconds of no
        activity to wait before <codeph>gpfdist</codeph> is forced to exit. When the environment
        variable is set and <codeph>gpfdist</codeph> hangs, the utility is stopped after the
        specified number of seconds, creates a core dump, and sends relevant information to the log
        file.</p>
      <p>This example sets the environment variable on a Linux system so that
          <codeph>gpfdist</codeph> exits after 300 seconds (5 minutes) of no
        activity.<codeblock>export GPFDIST_WATCHDOG_TIMER=300</codeblock></p>
    </section>
    <section id="section6">
      <title>Examples</title>
      <p>To serve files from a specified directory using port 8081 (and start
          <codeph>gpfdist</codeph> in the background):</p>
      <codeblock>gpfdist -d /var/load_files -p 8081 &amp;</codeblock>
      <p>To start <codeph>gpfdist</codeph> in the background and redirect output and errors to a log
        file:</p>
      <codeblock>gpfdist -d /var/load_files -p 8081 -l /home/gpadmin/log &amp;</codeblock>
      <p>To stop <codeph>gpfdist</codeph> when it is running in the background:</p>
      <p>--First find its process id:</p>
      <codeblock>ps ax | grep gpfdist</codeblock>
      <p>--Then stop the process, for example:</p>
      <codeblock>kill 3456</codeblock>
    </section>
    <section id="section7">
      <title>See Also</title>
      <p><xref href="./gpload.xml#topic1" type="topic" format="dita"/>, <codeph>CREATE EXTERNAL
          TABLE</codeph><ph> in the <i>Greenplum Database Reference Guide</i></ph></p>
    </section>
  </body>
</topic>
